ሺ݊, ݉ሻൌargmin
,
ሺܠെܠሻଶ
∀ܠ, ܠ∈ࣞ
(2.18)
gmin stands for minimising an expression ሺܠെܠሻଶ through
ing the arguments, which are i and j. The return of this calculation
ction of two indexes of two data points ܠ and ܠ, by which the
between them is the least. The notation ∀ means ‘for all’.
two data points (ܠ and ܠ) satisfy the above condition and have
ected. They are merged and removed from ࣞ. In addition, the
or the mean) of them is inserted into ࣞ. This mean or median is
meta data The operation for every merge is shown below,
ࣞൌࣞ\ ሺܠ, ܠሻ
ࣞൌࣞ⋃߱
(2.19)
s an operator called set minus for removing data from a set, ⋃ is
perator called set union for adding new data into a set. ߱ is the
a, the mean of ܠ and ܠ,The size of ࣞ is reduced by one after
rge. For instance, if ࣞ = (1, 2, 3, 4), ࣞ \ ሺ2, 3ሻ removes 2 and 3
eading to a new set ࣞ = (1, 4). Moreover, ࣞൌ ࣞ⋃ 2.5 adds 2.5
lting in ࣞ = (1, 4, 2.5). Note that 2.5 is the meta data point ߱ሺଶ,ଷሻ
ata points 2 and 3. This process continues until ࣞ contains only
data.
der to show how the hierarchical clustering algorithm works, a
of the 20 amino acids was used. To study protein sequence data,
required to use numerical data to encode the amino acids. Doing
ause most machine learning algorithms only accept numerical
e input. There have been a long history of investigating different
rs to encode the amino acids to numerical data [Kidera, et al.,
in, et al., 2007; Lin, et al., 2008; Fontaine, et al., 2019]. Table 2.2
e descriptor system for the 20 amino acids, by which each amino
encoded by three descriptors [Lin, et al., 2008].